comparison result
f3bfbd65743e60c685a3845bd61ce15f-Supplemental-Conference.pdf
L-CAD: Language-basedThe tricColorizationycle on the left is red, and the tricycle on the right is orange. We leverage a referring segmentation model to roughly estimate object contours mentioned in the ur description, which enables us to perform the instance-aware sampling strategy. Othe robustness of our model, we manually annotate a sequence of contours ranging from coarse to fine and visualize the corresponding colorization results. As shown in Figure 8, our model presents aG remarkable ability to produce condition-consistent colorization results even using imprecise contours. This is because the sampling is performed in the latent space using downsampled contours and the compression decoder in the pixel space could adaptively fix color bleeding issues.
Appendix
AAbout Equation (1) As we discussed in Section 3, label smoothing and focal loss are equivalent to the standard CE loss with an additional maximum-entropy regularizer (see in Equation (1) and (2) in the main text). The proof of Equation (2) can be found in the corresponding paper [4]. SVHN is an image dataset which consists of 32 32 colored images of 0 9 digits. CIFAR-10 and CIFAR-100 consist of 32 32 colored natural images arranged in 10 and 100 classes, respectively. For 20Newsgroups, we use the GloVe word embedding [7] for text representation before the 1D-CNN model and set the embedding dimension as 100.
Rebuttal for " Revisiting the Evaluation of Image Synthesis with GANs " Anonymous Author(s) Affiliation Address email
Our presentation is organized for following reasons: In Section 2.3, we present the228 details of generative models, evaluated datasets, and analysis approaches (including our visualization229 tool, histogram matching attack, and human evaluation). They are independent of each other, thus230 we discuss them in parallel in the main paper. In Section 3.1, we investigate the feature extractors231 by first identifying their attention on visual semantics, followed by investigating their robustness to232 the histogram matching attack. Finally, we filter extractors that define similar representation spaces.233 These studies are gradually deepening, thus they are organized in a progressive manner.
259a5df46308d60f8454bd4adcc3b462-Supplemental-Conference.pdf
As action decoder their mentioned architectures of is multimodal adopted in the in to paper Figure information generate, the 1. visual-gr natural with languages cross-attention ounded alignment conditioned blocks, decoder on while the is visual applied the visual-grounded input. Based on these deeply fused representations, we finally generate the predicted answers with the visual-grounded generation decoder. In this section, we describe the settings used when fine-tuning the pretrained models on various downstream tasks. We use RandomAugment [1] for data augmentation. The default settings for finetuning on each dataset are shown in Table 1.